Tandem repeats in giant archaeal Borg elements undergo rapid evolution and create new intrinsically disordered regions in proteins

Borgs are huge, linear extrachromosomal elements associated with anaerobic methane-oxidizing archaea. Striking features of Borg genomes are pervasive tandem direct repeat (TR) regions. Here, we present six new Borg genomes and investigate the characteristics of TRs in all ten complete Borg genomes. We find that TR regions are rapidly evolving, recently formed, arise independently, and are virtually absent in host Methanoperedens genomes. Flanking partial repeats and A-enriched character constrain the TR formation mechanism. TRs can be in intergenic regions, where they might serve as regulatory RNAs, or in open reading frames (ORFs). TRs in ORFs are under very strong selective pressure, leading to perfect amino acid TRs (aaTRs) that are commonly intrinsically disordered regions. Proteins with aaTRs are often extracellular or membrane proteins, and functionally similar or homologous proteins often have aaTRs composed of the same amino acids. We propose that Borg aaTR-proteins functionally diversify Methanoperedens and all TRs are crucial for specific Borg–host associations and possibly cospeciation.


Introduction
Metagenomics has led to the increasing discovery of microbial extrachromosomal elements (ECEs) from environmental samples [1,2]. This bears great importance, as it allows us to better understand evolutionary processes and functional roles of ECEs in natural systems and potentially use the ECEs, or elements of them, for genetic engineering. Using genome-resolved metagenomics, we recently discovered Borgs, which are unusually large ECEs. Based on gene content and co-occurrence patterns, Borgs associate with several species of anaerobic methanotrophic (ANME) archaea of the Methanoperedens genus [3]. A search for other ECEs associated with these archaea also led to the recent discovery of large plasmids found in members of the same genus, yet a distinct clade of Methanoperedens species [4]. PLOS  Rose Borg possesses the smallest (623,782 bp), and Green Borg possesses the largest genome (1,094,519 bp). As expected, based on the four previously reported genomes, five of the new genomes are linear, terminated by inverted repeats. Based on GC skew analysis, replication of Borg DNA is initiated at the chromosome ends (Figs 2 and S1 and S1 Table). The Red Borg genome contains a repeated sequence that prevented identification of a unique genome assembly path. The variant that generated the expected pattern of GC skew as for the other Borgs was chosen, completing the final set of ten Borg genomes. All genomes but the Green Borg genome are composed of a large and a small replichore; the Green Borg genome has a slightly more complex organization. Each replichore carries essentially all genes only on one strand. Consequently, there are no apparent transcriptional operons. This was mirrored by a low frequency of transcriptional regulators in Borg genomes. Specifically, we only found 0.35 transcriptional regulators per 100 kbp Borg genome, whereas near complete Methanoperedens genomes encoded 5.9 in 100 kbp (1 in Rose; 2 in Purple; 3 in Brown, Lilac, Sky, Green; 5 in Purple; 6 in Orange; and none in Ochre and Red Borg versus 53, 53, and 62 in three near-complete Methanoperedens genomes SRVP18_hole-7m-from-trench_1_20cm__Methanoperedens-related_44_31,RifSed_csp2_16ft_3_Methanoperedens_45_12,RifSed_csp1_19ft_2_-Methanoperedens_44_10; 2.77-2.90 Mbp). We also note that none of the Borgs encode a DNA-dependent RNA polymerase or a TATA-box binding protein.
The previously reported four completed Borg genomes and six new completed Borg genomes were used to evaluate the distribution and features of perfect TRs using a stringent threshold that allowed no mismatch (�50 nt region and �3 TR units). However, during the manual curation, we noticed that some regions had mapped sequencing reads with slight differences in the unit composition, usually a single nucleotide change. This is because reads  35 28 assigned to a specific Borg likely derive from thousands of genomes, many of which are slightly different to each other.
In the example of a repeat region in a MHC gene, there are two TR unit types that differ by a G ! A substitution SNP that is synonymous on the amino acid level. Interestingly, the mixture of TR variants can differ within a Borg population (S2 Fig).
To assess whether TR regions were particularly prone to mutation, and, specifically, if there is a bias toward certain SNPs, we analyzed the reads mapped to each Borg genome using inStrain [11]. We detected no SNPs in 3 genomes and extremely few SNPs in 6 genomes (2.2 SNPs across the cumulative TR regions of each genome). The exception was Lilac Borg, which harbored 17 SNPs of which 7 were within the same TR region within an MHC (S2 Table). These small numbers preclude any statistical analyses of different mutation incidences in the TR regions compared to the whole genome. Interestingly, we observed an A-bias in the SNPs found in TR regions, and this is primarily at the expense of G (14/18 cases). In contrast, A-SNPs found in non-TR regions were similarly at the cost of G (462/1057 cases) and C (494/ 1057 cases), but less frequently at the cost of T (101/1057 cases).

Regions with
TRs are fast evolving. The number of TR units in a region sometimes clearly differed within a single Borg population (Fig 3A). This implies that the Borg TR regions are fast evolving, much like CRISPR repeat-spacer inventories. Moreover, the repeat loci were rarely conserved in otherwise alignable regions of the most closely related Borgs and were absent from homologous proteins from the host or other homologues in NCBI or our own Reads mapped to a de novo assembly before and after curation. Reads are shown as grey bars below the consensus sequence. SNPs in reads are highlighted in color. The pink segments mark a unique sequence that was used to place misplaced reads correctly and the nucleotide repeat unit is shown as black segments. (A) A large number of SNPs indicates a local assembly error associated with a collapsed repeat region, resulting in a consensus sequence with one repeat unit. (B) The consensus sequence has three repeat units after curation.
https://doi.org/10.1371/journal.pbio.3001980.g001 database ggKbase. This is additional evidence that TR regions formed very recently, after Borgs diverged. A rare case where TR regions are similar occurs in Black and Brown Borgs, which are closely related based on genome alignments (Fig 3B). Close inspection of an approximately 7-kbp region in the aligned genomes revealed an 83% sequence identity and four different kinds of TRs. TR-1 is intergenic, is composed of a 20-nt unit repeated six times, followed by a nearly perfect additional unit of 21 nt, then another perfect unit. This intergenic TR is absent in Brown Borg. TR-2 is in an ORF and comprises 18 nt units that are identical in Black and Brown Borgs, where they occur seven and eight times, respectively. TR-3 is also in an ORF and comprises 21 nt units that occur consecutively six times in Black Borg. Brown Borg has two identical units in the same ORF, followed by a sequence that differs by one SNP, then another identical unit. TR-4 is also in an ORF and comprises 36 nt units that occur four times in Brown Borg, but these are absent in the same ORF from Black Borg. Interestingly, the nucleotide sequence from Brown Borg in the vicinity of TR-4 has high nucleotide-level similarity but no TRs.
To quantify the variation in TR unit number and compare it with the count of insertions/ deletions in different regions of the Borg genomes, we performed a case study using the genome of Ochre Borg with all reads mapped onto it. We found no instances throughout the entire genome of indels in well-mapped Ochre sequencing reads, except for in the TR regions. For this analysis, we included loci with only two repeat units and found that six of the 27 TR regions showed variation in the repeat unit number. Since many reads start or end within the TR region, they do not provide information regarding the total number of repeat units, hence the true incidence of variation is likely even higher.
2.1.3. TRs are flanked by partial repeats and are enriched in adenine. To constrain the mechanism behind TR formation in Borg genomes, we assessed characteristics of the TR sequences and their flanking DNA regions. Repetition of sequences with variation in GC versus AT content generally introduces GC/AT-symmetry in regions containing TR units (Fig 4   Fig 2. Borg genomes are composed of a large and a small replichore, and their replication is initiated at the termini. Shown is the genome architecture of Brown Borg, including terminal inverted repeats (white arrows), TR regions (red arrows), and GC composition. GC skew and cumulative GC skew show that the genome is replicated from the terminal inverted repeats (origin) to the terminus. The data underlying this Figure can be found in S1 Table. https://doi.org/10.1371/journal.pbio.3001980.g002 and S3 Table). Offset of the symmetric units and the TRs depends on the exact choice of the repeat unit. The TR regions can be preceded by sequences that are identical to the end of the TR unit and/or followed by sequences that are identical to the start of the TR unit (Fig 4A and  4B). These flanking partial repeats are often also adjacent to TR regions that contain only two repeats (and were excluded from statistics). Often the middle of the TR unit is not in either of the 5 0 or 3 0 partial repeats. When there are different choices for the repeat unit there are slightly different partial repeats that flank the TR sequences (Fig 4C).
The nucleotide composition of the ORF-carrying strands on both replichores was similar, with nucleotide frequencies of A>T>G>C (38.4%, 28.8%, 20.3%, 12.5%). The nucleotide composition of all TR units compared to this overall composition showed a strong bias towards A (48.4%), while all other nucleotides were depleted (T 23.7%, G 17.1%, C 10.8%) (S4 Table). This A-bias is analogous to the A-bias observed in SNPs within TR regions (S2 Table).
To establish that the differences in nucleotide composition of the TRs versus all nucleotides are significant, we performed a nonparametric Kruskal-Wallis test [12] and a false discovery The consensus sequence of the reads mapped to the curated genome contains three TR units (aqua segments). Reads are shown as grey bars below the consensus sequence ( ��� ). Six reads span this region perfectly ( �� ), five reads span this region but are missing one TR unit ( � , black segments), nine reads do not span the entire TR region. (B) Genome alignment of closely related Black Borg and Brown Borg. Genomic regions that align are depicted as the same-colored collinear blocks in the top panel. A 7-kb region (black box) shows four instances of TRs in these genomes: TR-1 is present in Black Borg but absent in Brown Borg; TR-2 is present in both Borg genomes but with different numbers of TR units; TR-3 is only present in Black Borg, but an imperfect version is present in Brown Borg; TR-4 is only present in Brown Borg. Same color segments show identical TR units; genes are depicted in grey.
https://doi.org/10.1371/journal.pbio.3001980.g003 rate (FDR) correction according to . This revealed that the compositional bias between the nucleotide TRs and non-TR sequences is of high statistical significance with corrected p-values of 9.78AU : Pleasenotethatscientificnotationsthroughoutthetexthavebeenreforma  Table). The fact that this compositional bias is observed in both genes, intergenic regions and in flanking partial repeats suggests that the A enrichment is important for the process that forms the repeats. Predicting the RNA secondary structure of the nucleotide TRs with RNAfold [14] revealed that some are predicted to form loops and others form hairpins.
We searched for polymerases in Borg genomes, given their potential relevance for repeat formation [15]. We found that all Borgs encode at least one DNA polymerase and phylogenetic placement within reference sequences [16] revealed that there are two types of PolBs encoded in the Borgs: the B2 clade and the B9 clade (S3 Fig). The DNA PolB9 encompasses a predicted 3 0 to 5 0 exonuclease and was found in each complete Borg genome. The amino acid sequences are very similar, suggesting a high mode of conservation for these particular proteins.

TRs are rare in Methanoperedens and introduce aaTRs in Borg ORFs.
To shed light on the role and function of the TRs, we surveyed their occurrence across all ten complete Borg genomes. We found 460 regions that make up 0.62% of the average Borg genome ( Table 1). Draft Methanoperedens metagenomes on the other hand only have 1 to 4 TR regions, making up �0.01% of the metagenomes (0.0099% in SRVP18_hole-7m-from-tren-ch_1_20cm__Methanoperedens-related_44_31, 0.0021% RifSed_csp2_16ft_3_Methanopere-dens_45_12, 0.0018% in RifSed_csp1_19ft_2_Methanoperedens_44_10), suggesting that TR formation is highly genome specific. Approximately half (43% to 65%) of the Borg TRs were located within ORFs. All TRs within ORFs had unit lengths that are divisible by three, so these repeats do not disrupt reading frames. They result in amino acid tandem repeats that we refer to as aaTRs (and aaTR-proteins). Only 14% to 38% of intergenic TR unit lengths were divisible by three (S4 Fig). Sometimes, several different TRs occurred within the same ORF, so the   Borg 13). Upon inspection of aaTR-proteins, we found a different codon usage in genes with TRs relative to all Borg genes. As expected, based on the relatively A-rich composition of repeat regions, the aaTR-bearing ORFs generally favored incorporation of codons containing A. Codons containing T in any position were generally depleted in aaTR regions. Codons with C in position 1 or 2 of the triplett, and G in position 1 were enriched in aaTRs, but codons with C or G in other positions or combinations of positions 1 to 3 were depleted (S5A Fig and S5 Table). The most frequent codons in aaTRs were CCA (encoding proline) and ACA (encoding threonine) with an up to 3-fold higher frequency in aaTRs than in non-aaTR regions. The most statistically enriched codons in TRs were GAT and AAA (encoding aspartate and lysine with corrected p-values of 1.63 × 10 −57 and 1.66 × 10 −52 ), and most statistically depleted codons in TRs were ATG and TTT (start codon and phenylalanine with corrected p-values of 6.43 × 10 −104 and 6.67 × 10 −100 ) (S5 Table). Seven codons were completely absent in aaTRs, namely, TGA, TAA, and TAG (stop codons), CGA, CGT, and CGG (encoding arginine), TTC (phenylalanine), and TGC (cysteine) (S5B Fig).

Amino acid frequency in aaTRs.
We closely examined the predicted biophysical and biochemical properties of the single repeat units from all ten Borg genomes and 37 additional proteins from curated Borg contigs. This final dataset comprised 215 Borg aaTR-proteins and 306 repeat regions (S6 and S7 Tables). Proline, threonine, glutamate, and lysine were particularly enriched across all Borgs, while tryptophan, cysteine, and phenylalanine were almost absent (Fig 5A and S8 Table). We noticed that disorder-promoting amino acids were enriched in the aaTRs and order-promoting amino acids were depleted (S6 Fig).
Hierarchical clustering of triple aaTR units revealed 28 aaTR clusters, each cluster composed of aaTRs with very similar amino acid composition and frequency (Fig 5B and S7  Table). Many aaTR units possessed nearly equal numbers of the charged amino acids K and E (polyampholyte repeat cluster); others were enriched in P/T/S, which are potential sites of posttranslational modification. From this, we conclude that there are distinct and related groups of aaTRs. To further investigate the observation that aaTRs particularly often contain disorder-promoting amino acids, we assessed the prominence of intrinsically disordered regions (IDRs) in aaTR-proteins versus all Borg proteins. IDRs are polypeptide segments that are characterized by a lack of a well-defined 3D structure [17]. Remarkably, 62.8% aaTR-proteins contained at least one IDR (�15 amino acids), whereas only 5.6% of all Borg proteins contained an IDR. The relative length of the IDR regions varied from 2.6% to 100% and 1.2% to 100% in both aaTR-proteins and non-aaTR-proteins (S7 Fig and S9 Table). In aaTR-proteins, the IDRs   TP TP VP TA  TP TP VP TA  TP TP VP   TA   TP  TP  IP TA   PT  PV  PT  AT   PT  PV  PT  AT   PV  PT  AT  PT   TP  TP  VP  TP   TP  VP  TD  TP   PV  PT  D TP   TP  VP  TD  TP  V   P V P TD   TP  T   TP  TP  V  P  TD   TP  TP  IP  TD   P  T  D  T  P  T  P  I   T  P  T  P  IA  T  P   T  P  T  P  IA  T T P T  P IA  T P T  TP TP IA  LI TP  TP IS TP TP   TI   LI TP TP   IS TP  TP  TI   TP  TP VP  D PT  A   PT PE  TA  TP  TY  TA  TP  TY  TP TY TP  TP  TY  YT  PT  PT ST AT  AT AT PIP I  PN VT GT PL  PN VT GT PL  TP LP NV TE  TP LP NV TE  PN VTA T  TPN VTE  VPK PNV TV  NPT V  VTIS PTP N  ANPT  TPNV TVVP TAIA  PNVTA EIT  TSNV  SVSISGT P  VPTPEPT  TPSPIIITP  IITPTPSPV  PTSTLTAS  TVTEPTPTV  VTAVETPT  TVNETPVV  TTVAPN  APNTTV  PKTDT TTAS  NIPTV TPTV  PTKI IVVN TPTV T  TPV IPTT NI  IEPT NYP TSV PT  PVQ PVA QT  QPP  IAQ TPI   PVA PPI VQ TPV   P   PIA PIQ T  IVP TIP P  ITE KP  KV VA E  PIE  PIP  EIT  SE PIE PI   PT ET VV P  PT  NT  TP  V   TT  SQ  PT  VV  K   TN IT S  AT  NV TS DN   NS  TS KN VT   KI NN TK   L   D TD IT D TK D VT   almost always corresponded to the TR region, and Methanoperedens homologues did not have IDRs (Fig 6A). Thus, we hypothesize that the TRs in ORFs mostly lead to the creation of new IDRs in existing proteins. Forty proteins possessed multiple repeat regions ranging from two to five aaTR regions (S10 Table), and the aaTRs were often located at the N-or C-termini or between domains. Some of these regions fall into the same or a similar aaTR cluster, whereas others are distinct and often located in a completely different region of the polypeptide chain. In 22 cases, the aaTRs make up the majority (50% to 96%) of the predicted proteins, and many of these are small proteins (14 proteins are <100 aa) (S11 Table). The repeat units of these proteins are diverse, but clusters enriched in K, E, and I (cluster 10 and 11) were particularly frequent (6/ 18). Although it is possible that these are wrongly predicted genes in intergenic regions, they may also be de novo repeat peptides. Their existence as real proteins is supported by the observation that they possess a start codon (or alternative start codon), the TR units are divisible by three and thus introduce aaTRs, and by the fact that four have signal peptides and four have functional annotations (linked to apolipoprotein, proline-rich domains, and a glycoprotein domain).

Functions of aaTR-proteins 2.3.1. Similar aaTRs fulfill similar predicted functions, and functionally related proteins have similar aaTRs.
To investigate which functions aaTR-proteins have, and if functionally related proteins possess similar aaTRs, we performed protein family clustering of 11,995 Borg proteins. In brief, we clustered proteins using the fast and sensitive protein sequence searching software MMseqs2 [18] in an all-versus-all approach to define protein subclusters based on amino acid similarities. Based on the alignments of the subfamily members, HMMs were generated for each subfamily using HHblits [19]. These HMMs were then functionally annotated by an HMM-HMM comparison with the PFAM database using HHSearch [20]. This resulted in 85% of Borg proteins being grouped into 1,890 subfamilies and 80% (172/215) of the aaTR-proteins clustered into 112 subfamilies (S6 Table). Based on the annotations for the individual protein members of each subfamily and the annotation of the HMM of the subfamily, protein subfamilies were manually grouped in terms of function. The functional landscape of the subfamilies ranged from carbon metabolism (18 proteins), cell and protein architecture and scaffolding (9), nucleotide processing (15), transcription-associated proteins (8), redox (8), signaling (11), transport (3), stress response (2), to protein processing (1) ( Table 2).
Screening the aaTR-proteins with the eukaryotic linear motif (ELM) resource for functional sites in proteins (http://elm.eu.org/) unmasked that the aaTRs within them have similarity to a plethora of different short linear motifs (SLiMs). SLiMs are composed of three to ten consecutive amino acids [21], which are used by eukaryotic cells as cleavage sites, degradation sites, docking sites, ligand binding sites, posttranslational modification sites, and targeting sites [22]. We found that aaTRs from the same aaTR cluster or related aaTR clusters often had matching functional predictions (see examples in S7 Table).
Fifteen protein subfamilies were enriched in aaTR-proteins, which included subfamilies functionally annotated as transmembrane phosphatases, ribonucleoproteins, phosphoesterases, zinc-ribbon proteins, and DNA-binding proteins; the other subfamilies had no functional annotations (S6 Table). Members of the same protein subfamily often had similar aaTRs. For example, a subfamily of DNA binding proteins (subfam0897) only comprises aaTR-bearing members, all of which have aaTRs that are unique in nucleotide sequence (S8 Fig), yet of the same or related repeat clusters (Fig 6B). These aaTRs form a predicted coil  Domain architecture of S-layer protein from Green Borg. The aaTR region is intrinsically disordered (predicted with IUPred3; [28]), and the IDR regions are predicted binding sites for proteins, DNA, RNA, or a linker region (predicted with flDPnn [60]; probability of feature increases with color intensity). (B) All or most members of two protein subfamilies have aaTRs. The aaTRs within a protein subfamily are from one or two related aaTR clusters. The amino acid sequence of each aaTR unit is shown in brackets.
https://doi.org/10.1371/journal.pbio.3001980.g006 Table 2. Functional annotations for protein subfamilies with aaTR-bearing protein members. Proteins were manually placed into functional categories based on the pfam annotation of the subfamily. Single aaTR units are listed as well as in which aaTR cluster they fall. Some aaTRs did not fall into clusters (n.c., no cluster), and some proteins have two or more aaTRs, which fall into one or two aaTR clusters.  structure or an IDR and resemble SLiMs that play a role in ligand binding, degradation, and targeting (S7 Table). Similarly, subfam1609 comprises eight members, five of which are aaTRproteins, and, despite all being only distant homologues (highest aa identity is 67%), the aaTRs are encoded by unique nucleotide TR units (S8 Fig) but belong to the same aaTR cluster (SVG/IQ), which corresponds to predicted modification (phosphorylation) and ligand-binding sites (Fig 6B and S7 Table).

aaTR-proteins responsible for cell integrity/stability and surface decoration.
Several functionally related but phylogenetically unrelated proteins that are responsible for cell wall architecture (PEGA and S-layer) and decoration (glycosyltransferases and glycosyl hydrolases) have aaTRs. The aaTRs of the PEGA proteins resemble predicted ligand binding, docking, and modification (phosphorylation) sites, and the aaTRs of the S-layer resemble proteolysis-initiating degrons [23]. The aaTRs of glycosyl hydrolases resemble modification and ligand binding sites, and the aaTRs of glycosyl transferases resemble degrons (S7 Table). Some Borgs also possess an aaTR in tubulins, which are proteins required for cell division [24]. These aaTRs resemble SLiMs that are potential docking and cleavage sites and/or initiate proteasomal degradation. Importantly, these SliM-bearing aaTRs are absent in non-Borg homologues.

Case studies of aaTR-proteins: Ribonucleoproteins, MHCs, and a conserved TR hotspot
To further investigate proteins with the same function that evolve similar but not identical aaTRs, we performed in-depth analyses of two distinct types of proteins with a known function and a conserved region across Borgs comprising multiple aaTR-proteins.

Ribonucleoproteins.
Most Borgs encode Sm ribonucleoproteins, which are archaeal homologues of bacterial Hfq and eukaryotic Sm/Sm-like (Lsm) proteins. These proteins are implicated in versatile functions such as RNA-processing and stability, and the protein monomers form a stable hexameric or heptameric ring-shaped particle [25,26]. We found 19 Borg Sm ribonucleoproteins (S12 Table), five possess aaTRs, and one additional sequence from Grey Borg has a near-perfect aaTR. The aaTRs are always located at the N-terminus, preserving the conserved Sm1 and Sm2 RNA-binding motifs (S9A Fig). The aaTR units contain 4, 8, 12, or 17 amino acids, have the same aa character, and are predicted SLiMs that facilitate docking or resemble degrons (S9B Fig and S12 Table). An initial homology-based structural search of the aaTR-Sm from Black Borg identified the bacterial Hfq of Pseudomonas aeruginosa (PDB: 4MML, aa identity: 26%) that forms a homohexameric ring structure. Due to no alignable template in the database for the aaTR, the model failed to predict a structure of this N-terminal region. Remodeling with AlphaFold2 predicted the structure of the Sm core that formed a hexameric ring with long loops extending at the distal side of the protein (Fig 7). This unstructured region corresponds to the aaTR region and matches the prediction of MobiDBLite [27] and IUPred3 [28] that the aaTR is an IDR. Modeling of the other four aaTR-Sm proteins also showed N-terminal unstructured extensions corresponding to the aaTRs, which were similarly predicted IDRs (S9C Fig). 2.4.2. Extracellular electron transferring MHCs. All Borgs encode MHCs, which are particularly important and abundant in Methanoperedens as they mediate the final electron transfer from methane metabolism to an external electron acceptor [29]. We found 14 MHCs with aaTRs in Borg genomes. All of these are predicted to be located extracellularly, some possess a membrane anchor, and they range from 20.5 kDa (190 aa) to 137.5 kDa (1,293 aa) and contain up to 30 heme binding sites (Fig 7). The proteins are thus clearly distinct in sequence and domain architecture. The aaTRs are located at different sites in the polypeptides but never disrupt existing functional domains. Yet remarkably, the aaTRs belong to identical or similar repeat clusters that are consistently enriched in T and P and resemble ligand binding, docking, and modification sites (S13 Table). The aaTRs are mostly IDRs, which are predicted to form unstructured extensions that protrude from the folded protein core.

A conserved aaTR hotspot.
There is a region in all complete Borg genomes that is a hotspot for TRs with up to five aaTR-bearing proteins (Fig 8A). It includes a gene encoding subfam1773, which we refer to as cell envelope integrity protein TolA. Best blastp and structural hits are TolA from, e.g., the pathogenic bacterium Leptospira sarikeiensis (48% aa identity with WP_167882360) and YgfJ from the pathogen Salmonella typhimurium (PDB: 2JRP). Eight TolA proteins have unique aaTRs, and some carry additional aaTR units shared by other Borgs (e.g., Ochre Borg repeat units are found in Rose Borg and Purple Borg) (Fig 8B)  aaTRs are similar in amino acid composition and they are predominantly predicted degradation, docking, and ligand binding sites (S6 and S7 Tables). The regions encode two other conserved protein subfamilies with aaTR-bearing members: subfam0649 lacking functional annotation and a subfamily of zinc-ribbon proteins (subfam0108), which usually form interaction modules with nucleic acids, proteins, or metabolites. The subfam0108 proteins possess seven distinct aaTR units, six of which fall into the same repeat cluster (S6 and S7 Tables). Additional aaTR-bearing proteins in the same context are a glycogen synthase/glycosyl transferase (subfam0184) in Black Borg and Green Borg and proteins of unknown function (sub-fam1382) in Lilac Borg and Orange Borg, and a hypothetical protein in Green Borg.

Discussion
Tandem nucleotide repeats in Borg genomes likely are formed and evolve independently, as evidenced by the essentially unique sequences of TR units in each region within and among Borg genomes. They evolve rapidly, based on differences in the alignable regions of the most closely related Borgs (e.g., Black and Brown Borgs) and variability in repeat numbers within a single Borg population. This parallels the rapid evolution of eukaryotic variable number TRs (VNTRs). VNTRs are widespread and linked to neuropathological disorders (caused by the accumulation of short TRs) [30], gene silencing [31], and rapid morphological variation [32]. In bacterial genomes, highly repeated sequences are less common, and TR variations are linked to immune evasion, cell-pathogen specificity (definition of which cells/tissues/host pathogens/viruses can infect), and stress tolerance [33,34]. They are relatively common in a few archaeal genomes, but the functions there remain uncertain.
Previously proposed mechanisms for the expansion and retraction of VNTRs implicate various DNA transactions, including replication, transcription, repair, and recombination [15]. Replication slippage is one explanation for the origin and evolution of repeat loci [33]. This mechanism involves a pausing of the DNA polymerase and dissociation from the TR region, followed by DNA reannealing. Realignment of the newly synthesized strand can be out-of-register on the template strand, leading to propagation/retraction of the TRs. At this time, it is unclear whether the TRs are introduced by Borg or Methanoperedens machinery. If the former, it is notable that we found that all ten Borgs encode a highly conserved DNA PolB9, a clade of uncharacterized polymerases that have only been reported from metagenomic datasets [16]. Archaeal DNA polymerases B and D have been shown to slip during replication of TR ssDNA regions [35]. Thus, the Borg PolB9 could be responsible for TR propagation, possibly triggered by noncanonical secondary structures formed by the TR DNA or its upstream region [35].
Given the shared nucleotide characteristics of genic and intergenic TR regions (perfection, A bias), we infer that they form by the same mechanism. Even if TRs are introduced perfectly, one would expect some of them to have accumulated mutations, unless perfection is strongly selected for and ensured through a repair mechanism, or unless they all formed very recently. The latter is consistent with other evidence that TR regions are fast evolving, namely the variability in TR unit number within a Borg population.
In many cases, the sequences that flank the TR regions are partial repeats. We suspect that these are seed sequences that were used to initiate TR formation. If these regions were the seeds that gave rise to the TRs, the fact that they are often only parts of the repeat raises a mystery regarding the origin of central regions of these TRs. It is possible that the sequences that served as seeds were subjected to elevated mutation rates, which is supported by the observation of high SNP levels flanking TR regions in the human genome [25,27].
CRISPR repeats suggest a possible alternative to the replication slippage explanation for the origin of TR regions. Like TR regions, CRISPR loci are fast evolving, in that case motivated by the need to counter rapid evolution of bacteriophages to outwit spacer-based immunity [36]. CRISPR repeats are introduced by an integrase that excises the previously added repeat, ligates it to a new spacer, and adds that unit to the expanding locus, filling gaps to recover a double stranded sequence [37]. If a Borg system is involved, the genomes encode many and various nucleases and recombinases, some of which may be responsible for repeat addition. Unfortunately, a complete Methanoperedens genome is not available yet, so we cannot confidently assess their capacities for repeat introduction.
As noted above, our analyses revealed a compositional bias in nucleotide TRs towards A on the coding strand of each replichore, primarily at the expense of T. If this reflects mutation, the transversion (conversion of purine to pyrimidine) must happen at the biogenesis of the TR template, as the TRs are faithfully copy-pasted error-free. We tentatively support the alternative explanation that the A-rich nature of the seeds initiated TR formation. A small region that is A-rich, possibly functioning in a manner somewhat analogous to a protospacer adjacent motif (PAM) that binds the CRISPR system nuclease, may localize the machinery involved in repeat formation. We speculate that TR formation could be regulated by distinct methylation patterns of Borg DNA, possibly related to the A-rich character of the seed.
An analysis of the SNPs in TR regions provides another clue regarding A-enrichment in the TRs. Although the small data size precludes statistical analysis, we note that the reads with SNPs are more A-rich than the consensus repeat sequences, and this enrichment is apparently at the expense of G. This could suggest that during Borg genome replication or TR propagation, guanines in TRs are selectively mutated to adenines. This may be reminiscent of G to A transitions observed in human diseases at sites of 5-methylcytosine [38], but much larger datasets will be required to investigate this further.
A key question relates to how stable coexistence of Borgs, with their huge genomes, likely in multicopy [3], and the host Methanoperedens cells is achieved. Although it is possible that Borgs direct the functioning and evolution of the hosts upon which they fundamentally depend on for their existence, it is perhaps more likely that Methanoperedens controls Borg activity. The mechanisms for regulation of Borg gene expression are unclear, given the absence of obvious operons and the conspicuous lack of transcriptional regulators. As we identified no RNA polymerase and TATA-box binding proteins in any of the ten complete Borg genomes, it is possible that Methanoperedens can tightly regulate Borg gene transcription, thus controlling the production and localization of Borg proteins. TRs may be involved in this process. Approximately half of the TR regions were intergenic or have the first unit within or partially within gene ends. These could be (long) noncoding RNAs ((l)ncRNAs), which are known to modulate chromosome structure and function, transcription of neighboring and distant genes, affect RNA stability and translation and serve as architectural RNAs (arcRNAs) [39]. arcRNAs are of particular interest as they are involved in forming RNA-protein condensates, and some arcR-NAs possess repeat sequences to accumulate multiple copies of specific proteins and/or multiple copies of RNAs [40].
Virtually all TRs in ORF were of lengths divisible by three, thus do not disrupt ORFs, and their A bias was not as strong as intergenic TRs. This indicates a strong selective pressure to always introduce amino acid repeats and a selection for (or deselection against) specific codons. Bulky or highly reactive amino acids such as cysteine, which could severely interfere with redox homeostasis, were absent. Yet, disorder-promoting amino acids were enriched in aaTRs, and, concomitantly, most aaTR regions were predicted to be IDRs. Thus, we infer that aaTRs could impact protein functioning. While protein homologues from Methanoperedens (or other microbial genomes) almost never showed aaTRs, Borg homologues (or functionally related Borg proteins) had aaTRs that were highly similar in aa character, but not identical in DNA sequence. For very closely related Borgs, this may arise because similar DNA regions in Borg genes were seeds for TR formation. However, the presence of aaTRs in different regions of proteins with the same function (e.g., MHCs) indicates a strong selective pressure for specific aaTRs that is constrained by protein function. Overall, we consider the combination of features to be strong evidence that the aaTRs are not random genomic aberrations and those in ORFs can fulfill an important role in protein functioning.
To shed light on the predicted function that the proteins gain by having evolved aaTRs, we screened many of aaTRs for predicted functional sites. Most aaTRs are IDRs, and these are commonly found in SLiMs and vice versa [17]. In the case of the N-terminal aaTRs in the Borg ribonucleoproteins, the long loops that extend from the RNA-binding protein core could serve as docking sites for interaction partners. This is reminiscent of eukaryotic homologues that possess fused C-terminal IDR domains and are part of the spliceosome, a ribonucleoprotein machine that excises introns from eukaryotic pre-mRNAs [41]. The aaTR extension of the Borg ribonucleoproteins may undergo a disorder-to-order transition induced by binding to a yet to be identified protein partner, which could redefine its function in, e.g., pre-mRNA or pre-tRNA processing [42]. The aaTRs associated with MHCs were also predicted to form long, largely unstructured extensions. Their consistent compositional dominance by proline and threonine clearly suggests a shared function and convergent evolution of these aaTRs. Based on the predicted SLiMs within the aaTRs, they could be intraprotein intersections that are glycosylated or phosphorylated, triggering a signaling cascade and conformational changes that lead to a modified electron flow, redox capacity, and, possibly, nature of electron acceptor. Since intrinsic disorder is a common feature of (eukaryotic) hub proteins [43], the aaTRs could also be involved in navigating interprotein interfaces by being docking sites for other MHCs or extracellular oxidoreductases, such as a manganese oxidase that we found in Lilac Borg [3]. We thus propose that the Borg MHCs expand the redox and metabolic capacity of a large MHC network at the cell surface of the host [44] and aaTRs within them may enable a tunable connection to the electron conduction system that is integral to Methanoperedens' metabolism.
We found that aaTRs were statistically enriched in Borg proteins with predicted extracellular and membrane localization, another indication that their formation is not a random event. These proteins are typically implicated in cell-cell interactions, transport, protection, and virulence. It is unclear whether Borgs have an existence outside of Methanoperedens cells, but the inferred high copy number of some Borgs compared to host cells [3] may point to this possibility. Borg-encoded S-layer and PEGA domain proteins (with and without aaTRs) could potentially encapsulate Borgs and mimic host proteins to evade host defense. Alternatively (or at a different time in their existence), these Borg proteins could be displayed on the Methanoperedens cell surface to alter host processes.
If Borgs can exist outside of host cells, they would need the ability to infect Methanoperedens cells to replicate (analogous to viruses). Proteins involved in infection and defense are expected to be fast evolving, and TRs may serve as a mechanism to enable this. A hotspot for TR evolution that is conserved across Borgs encodes TolA (and additional conserved hypothetical proteins). TolA is important for cell envelope integrity and is crucial for entry of filamentous bacteriophages that infect Escherichia coli or Vibrio cholerae [45,46]. TolA is exceptional, as unlike other Borg aaTR-proteins, homologous proteins from Streptomyces and Bacillus are also repeat proteins with similar aa signatures (EAKQ). The observation that the TolA-containing regions in Borgs are conserved, fast evolving, and under strong selective pressure could be consistent with a role in Borg-host attachment. As proteins produced in the host, they may facilitate cell-cell interaction (e.g., for lateral gene transfer).
A striking observation from prior work at the wetland site is the large diversity of different and coexisting Borgs and Methanoperedens species [3]. Based on correlation analyses, different Borgs associate with different Methanoperedens [3], implying coevolution of Borgs and hosts. Lateral gene transfer has been reported as a driver for metabolic flexibility in members of the Methanoperedenaceae [47]. Gene acquisition (lateral gene transfer) is a prominent capacity of Borgs, which gave rise to their name and may contribute to Methanoperedens speciation. TR regions also may be key for establishing new Borg-host associations, especially if the aaTRs and noncoding TRs enable functional cooperation between Borg and host protein inventories. Thus, while CRISPR repeats evolve rapidly to defend against phage infection, rapid evolution of Borg TRs may be required to maintain coexistence with their hosts during coevolution.

Conclusions
We conclude that the nucleotide regions flanking repeats, and the uniqueness of the TRs in each locus, likely indicate that TRs arise from local sequences rather than being introduced from external templates. Other constraints on the mechanism behind TR formation are their generally perfect sequence repetition and A-enriched composition. Many TRs lead to aaTRs in proteins, and these are usually IDRs that are also predicted posttranslational modification sites and protein or nucleic acid binding sites. We propose that aaTR-proteins expand and modify the cellular and metabolic capacity of Borg-bearing Methanoperedens, yet expression of their large gene inventories is likely under tight control. Introduction of TRs in both genes and intergenic regions may be central to regulating Borg gene expression, translation, protein localization, and function. TR regions change rapidly in number and distribution to generate within population heterogeneity and between population diversity. This feature is likely central for Borg infection, association, and cospeciation of Borgs and their Methanoperedens hosts.
Manual curation of Borg genomes was performed in Geneious Prime 2021.2.2 (https:// www.geneious.com). It involved piecing together and verifying by paired reads placement, fragments of approximately the same GC content, sequencing reads coverage, phylogenetic profile, and relatedness to known Borgs into a single chromosome. Subsequently, careful visualization of the patterns of read discrepancies was used to locate local assembly errors, most of which were fixed by either relocating paired reads or introducing previously unplaced paired reads.

Genome visualization and alignments
Genomes were visualized in Geneious Prime 2021.2.2 and aligned with the MCM algorithm, or the progressiveMauve algorithm when aligning multiple contigs. Genetic neighborhood comparisons were performed using clinker [48] (v0.0.21).

GC skew analysis
Replichores were predicted by calculating the GC skew (G − C/G + C) and cumulative GC skew using the iRep package (gc_skew.py) [49].

Statistical analysis
To assess the significance of the observed compositional bias between the nucleotide TRs and non-TR sequences, we performed a Kruskal-Wallis test [12]. This included separating DNA sequences into repeat and non-repeat segments and dividing these into 50 bp substrings (resulting in a total of 803 substrings from repeat DNA and 9,430 from non-repeat DNA). The incidences of each nucleotide were counted on each substring, and the values were placed into 4-member nucleotide frequency vectors [N A , N T , N C , N G ]. Nucleotide frequency vectors for each 50 bp substring were then grouped into repeat and non-repeat categories and used as input to the Kruskal-Wallis test, which was implemented in SciPy (v. 1.9.0). The values were then further corrected by performing an FDR correction according to Benjamini-Yekutieli [13].

Secondary structure prediction
The secondary structure of TRs was predicted with RNAfold WebServer [14].

Protein family clustering
A dataset of 11,995 Borg proteins was constructed using all proteins from the ten curated Borg genomes and 37 additional aaTR-proteins from curated contigs of Pink, Blue, Steel, Olive, Grey, and Apricot Borg. All proteins were clustered using the fast and sensitive protein sequence searching software MMseqs2 in an all-versus-all search using sensitivity 7.5, cover 0.5, e-value 0.001 [18] (v7e2840992948ee89dcc336522dc98a74fe0adf00). The sequences of each member within a protein subfamily were aligned using result2msa of MMseqs2 and used as input to construct HMMs for each subfamily using HHblits [19]. The HMMs were then profiled against the PFAM database by HMM-HMM comparison using HHsearch [20], and protein subfamilies enriched in plasmid proteins were determined as described previously [54].

Hierarchical clustering of aaTRs and construction of a phylogenetic tree
Hierarchical clustering of aaTRs was performed using triple TR units for the alignments, since three was the minimum repeat unit length set as threshold, and the regions were dynamic. The alignments were performed with MAFFT [61] (v7.453) (-treeout-reorder-localpair). The aaTR clusters were visualized and decorated in iTOL [62]. An aaTR cluster was formed when a branch contained 3� related sequences. The names of the clusters were given based on the most abundant amino acids found in the repeat units (amino acids represented by 10%� became name-giving).

Structural modeling
Structural modeling of aaTR-bearing Sm proteins was initially performed using SWISS-MO-DEL [65] and the best hit in the Swiss-Prot database as template. Further structural modeling of aaTR-bearing Sm proteins was performed with AlphaFold2 using ColabFold [66] and selecting the experimental option homooligomer 6. Structural modeling of aaTR-MHC proteins was performed using AlphaFold2 [67] via a LocalColabFold (-use_ptm-use_turbo-num_relax Top5-max_recycle 3) [68,69]. Modeled protein structures were visualized and superimposed onto PDB structures using PyMOL [70] (v2.3.4).  [27] (threshold: �15 consecutive residues). aaTR or IDR regions were divided by the full protein length to calculate the relative length. The localization of aaTR and IDRs was calculated by dividing the mean coordinate for each region by the full sequence length. A total of 178 Borg aaTR-proteins had 220 aaTR regions (blue), and 112/178 aaTR-proteins had IDRs (green). A total of 557 Borg non-aaTR proteins had an IDR (yellow). The data underlying this Figure can be found in S9 Table. (TIF)  Table. GC skew and cumulative GC skew analyses of Borg genomes. Table accompanies Figs 2 and S1. GC skew was calculated using the iRep package (Brown and colleagues, 2016) [1]. (XLSX) S2 Table. Single nucleotide variation analyses of TR regions in Borg genomes. Analysis was performed using inStrain (Olm and colleagues, 2021) [2]. SNV, single nucleotide variant, which is equivalent to SNP. (XLSX) S3 Table. GC/AT-symmetry in regions containing TR units. Table accompanies Fig 4. The GC content was calculated in Geneious. Detailed information on each TR region can be found in S14 Table. (XLSX) S4 Table. Nucleotide composition of Borg genomes and TRs (in %). The nucleotide composition was calculated for the coding strand of each replichore. Subsequently, a sum of both values was formed and divided by the genome length and multiplied by 100%. (XLSX) S5 Table. A. Positional nucleotide frequency in aaTR regions and non-aaTR regions of ORFs.  Table. Sm ribonucleoproteins in Borg genomes. Features of aaTRs were predicted using the eukaryotic linear motif (ELM) resource for functional sites in proteins (http://elm.eu.org/). (XLSX) S13 Table. Multiheme cytochromes in Borg genomes. Features of aaTRs were predicted using the eukaryotic linear motif (ELM) resource for functional sites in proteins (http://elm. eu.org/). (XLSX) S14 Table. Statistics on TR regions in each Borg genome. TR regions were identified using custom code available on GitHub (https://github.com/rohansachdeva/tandem_repeats). (XLSX) S15 Table. Statistics on TR regions per Borg replichore. TR regions were identified using custom code available on GitHub (https://github.com/rohansachdeva/tandem_repeats). (XLSX)